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volution problem applies to each subpool separately 
however, the diversity optimization applies across the 
library as a whole. A library design strategy must 
herefore be able to suggest pools within eachlLhich 
there is the minimum product and substituent molec- 
ular weight redundancy. At the same time the diversity 
of the library should be optimized over all the pools 

It is important to note that the design requires the 
opt„ 10 n of the whole library, not selection of 
individual library compounds, and thus requires the 
consideration, and comparison, of all possible libraries 
rather than all possible library compounds. In a com- 
ing . f ^ iS ' ^ ^ C3Se ' im P° ssi b'e to select 
indmdual combinations of precursors without making 
all other products containing those precursors 

suShw" 1 Si2e - " Umber ° f P^cursors that are 
suitable for use ,n constructing a combinatorial mixture 
library ls often larger than the number that can reason- 
ably be used. Hence, typical mixture library design 
problems have huge search spaces, the sizes of which 
are best illustrated by two of the examples to which we 
have applied our method. The design of these libraries 
is discussed in the results section of this paper 

In library I, two sites of diversity were available A 
reaction scheme for the production of this library is 
given m Scheme 1. There were 360 commercially 
available precursors compatible with the chemistry for 
K, and 259 for R 2 . There were therefore 93 240 possible 
library product compounds. The design of this library 
called for the production of 10 000 compounds by com- 
bining 100 R,s with 100 R 2 s. } ■ 
The number of ways of selecting k objects from n is 



n\ 



{n- k)\k\ 

and so the number of possible libraries 



360. 



^ioo" C 100 = 2.5 x 10 



164 
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Resirk, 




R 2 b 

"This is partly based on the scheme of Mayer et al.™ 

To put this number into perspective, the age of the 
universe is approximately 10" s and its size 10«» A 3 
Library II for which a reaction scheme is shown in 
Scheme 2, had three sites of diversity on a small 
molecule core. In the existing inventory of precursors 
available within Abbott at the time at which the libra™ 
was to be constructed, there were 53 suitable candidates 
or R, and R 3 and 42 for R 2 . No fixed size was set for 
this library. Instead, a maximum allowed redundancy 
at any one molecular weight was specified and the 

rZZf P ° S m We f' 26 ' ibrary rec " uired ' In case 
therefore^ combinations of picking between 1 and 53 
compounds for R,, while picking between 1 and 42 for 
R 2 and 1 and 53 for R 3 , are allowed, giving ap- 
proximately 10'»« possible libraries each Ltalning 
between 1 and 1 17 978 compounds. 8 
The problem size makes it impossible to tackle bv 
enumeration. For example, if 100 potential libraries 
could be evaluated for their mass redundancy and 
diversity in one CPU second, an exhaustive enumera- 
tion of all possible solutions for library I would required 
on the order of 10 J54 years. H""eu 
In summary, library design is a complex problem 
requiring the optimization of a number of often compet- 
jng factors, over a vast search space. Genetic algorithms 
have been successfully applied to a wide rang! of such 
problems ,n both chemical and nonchemical domains " 
A genetic algorithm is a computational technique that 
mimics the processes of Darwinian evolution I7 -'» A 
potential solution to a problem is encoded in a repre- 
sentat.on termed a chromosome. This is typically a 
strmg of bits, integers, real numbers, or symbols each 
of wh.ch ,s termed a gene. The genetic algorithm 
operates on a population of these chromosomes that are 
generated by assigning values to the genes in the 
chromosomes, often at random. A fitness function 
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Figure 1. Chromosome encoding. The chromosome encodes 
a representation of a library in a bit string. Each gene (bit) 
corresponds to one precursor and is set to 1 if that precursor 
is to be used in that library. Knowing the positions of the R 
group boundaries allows the chromosome to be decoded to give 
the library product combinations as shown. 

measures how well adapted each chromosome is to its 
environment. In our example this would equate to how 
diverse the library represented by the chromosome is 
and how easy it would be to deconvolute. Once an initial 
parent population is generated, it is subjected to some 
evolutionary processes to breed a new child population, 
with the better adapted parents being allowed a greater 
chance to produce children. As with natural evolution, 
over successive generations the members of the popula- 
tion become better adapted to their environment, and 
thus better solutions to the problem in hand are 
discovered. 

There are two prerequisites to. being able to apply a 
genetic algorithm to a problem. The first is to be able 
to choose a representation that allows every possible 
solution to the problem to be encoded in a chromosome. 
The second is that it must be possible to write a fitness 
function to decode the chromosome and produce a score 
that reflects the quality of that solution. 

GALOPED - Genetic Algorithm for Library 
O/rtimization for Efficient Zteconvolution 

We have developed a program, GALOPED, that applies a 
genetic algorithm to the library design problem. The program 
is implemented in C for Silicon Graphics UNIX workstations 
using the SUGAL package. 20 SUGAL implements a wide 
range of different methods for each process in a genetic 
algorithm and provides a mechanism for these to be optimized 
in combinations. It requires the user to provide C code for 
the fitness function, data entry, and results output and for 
any nonstandard genetic algorithm operators to be used. The 
package includes a MOTIF interface to allow parameter setting 
and monitoring of the progress of the program. The interface 
was extended to create windows for browsing the frequency 
distributions and the chemical structures of the precursor sets 
for each chromosome in a population. 

Encoding Strategy. In our program each chromosome 
represents a different potential library. Each is therefore 
nominally split into a number of sections corresponding to the 
number of diversity sites. The number of genes in each section 
is equivalent to the number of available precursors for that 
position. The genes are binary, a 1 indicating that the 
precursor is to be used at that diversity site and a 0 that it is 
not. A gene is said to be set if its value is 1 and unset if it is 

0. A simple example of this representation is shown in Figure 

1, in which there are three sites of diversity, with Ri having 
four suitable precursors, A— D; R 2 having two, E and F; and 
R 3 having three: G-L This requires a 9-bit chromosome with 
the first 4 bits representing the first diversity site and 
specifically bit 1 representing the presence or absence of 
precursor A in the library represented by that chromosome. 
It is important to note that given only the gene positions of 
the section boundaries, it is possible to decode a chromosome 
to give the precursor combinations in the products that are 
required by the fitness function. 



Substituents that are specified to be included in all solutions 
are not allocated a gene position in the chromosome, since the 
chromosome represents choices that can be made. However 
they are included when enumerating the substituent combina- 
tions of the library products during the fitness function 
evaluation, 

. In some library syntheses, a set of precursors is added to 
more than one position simultaneously, and therefore the same 
precursor set must: he used at each of these positions. The 
program allows one or more diversity positions to be fixed as 
equal to another. In this case the repeated precursor set(s) 
is (are) not represented in the chromosome but included in the 
enumeration of products during the fitness function evaluation. 

It should be noted that this encoding is different from other 
genetic algorithm applications to combinatorial chemistry 
which have been reported recently. 21 " 23 In those studies a 
single library product is encoded in each chromosome and a 
population of individual library products is optimized rather 
than a population of complete libraries. 

A step-by-step procedure for the GALOPED program is as 
follows: 

1. Input substituent sets, diversity measurements, and 
design criteria. 

2. Generate an initial population of chromosomes by 
random initialization. 

3. Calculate the fitness for each chromosome. 

4. Create the mating pool by biased random selection of 
parents. 

5. Create children by crossover and mutation of two 
randomly selected parents from the mating pool. 

6. Calculate the fitness of the children. 

7. Insert (fitter) children into population by displacing 
(weaker) parents. 

8. If the maximum number of generations has not been 
reached and there have been improvements in fitness, the last 
n generations go to step 3. 

9. Display results. 

Each stage of the above procedure will now be described in 
more detail. 

Step 1. Program Input. When different structural classes 
of precursor are being used at the same diversity site (for 
example, both carboxylic acids and acid chlorides might be 
compatible with the chemistry at a given R position), it is 
necessary to transform the precursors into substituents so that 
the correct molecular weights can be calculated and correct 
pharmacophore point, types assigned. A substituent is the 
form in which a precursor will occur in the library products, 
A number of programs may be used to achieve this trans- 
formation. 24 " 26 

After producing the substituent lists, the necessary diversity 
calculations must be performed. If diversity of the precursor 
* sets is to be considered, the substituents to be used at each 
diversity site are separately grouped into 2D clusters or 3D 
families. We do not currently combine 2D and 3D descriptors. 

2D clusters are produced using MACCS structural keys 
followed by Ward's agglomerative clustering. 10 Wards cluster- 
ing makes use of pairwise Euclidean distances calculated from 
the MACCS keys. We have conducted extensive validation 
studies on these methods, looking at their ability to separate 
known active and inactive structures in a number of datasets. 
We have shown the methods to be able to achieve good 
separations of structures into biohomogeneous clusters, i.e., 
those containing mostly structures of one activity class. 10 
Others have readied the same conclusions studying the same 
types of descriptor. 27 

3D families are produced by identifying all potential phar- 
macophore points in each precursor and then grouping to- 
gether those having identical patterns of all pharmacophore 
points, within a given distance tolerance, in the single con- 
formation produced by the CONCORD program. 8 28 This is a 
partitioning procedure, rather than clustering, since every 
compound with a particular pharmacophore pattern is as- 
signed to the same family. 

These cluster or family numbers are used by the fitness 
function to assess the diversity of the subsets of precursors 
selected for a particular library, since each 2D cluster or 3D 
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family contains structures with a common set of features. Our 
validation studies have suggested that by selecting one from 
each cluster or family, there is a good chance that all activity 
classes present in the dataset as a whole will be represented 
in the selected set. 

If diversity is to be assessed over the product structures, 
then all possible product structures are enumerated. For all 
libraries of nontrivial size the 3D grouping procedure is too 
slow to be used. Therefore, clusters are produced using 
MACCS keys and Ward's agglomerative clustering. Due to 
the time requirements of the clustering algorithm, there is a 
practical limit of around 200 000 library products using this 
method. As an alternative we are currently investigating the 
use of cell membership numbers produced using the Diverse- 
Solutions software 11 which, initial experiments indicate, should 
allow sets of several million structures to be processed. 

Input to the program consists of a list of named substituents 
with structures in SMILES format, each with an indication 
of its R position, group number, and whether it must be 
included in all suggested solutions. The substituent structures 
are necessary to allow browsing of the chosen sets once the 
design is finished. Properties of the products, such as cluster 
or cell numbers, are indexed using composite names formed 
from concatenation of the precursor names to allow their 
lookup by the fitness function. A number of other parameters 
that must be defined by the user are discussed in subsequent 
sections. 

Step 2. Initializing the Population. A fixed size library 
design implies that a fixed number of genes must be set in 
each section of the chromosome in the final solution. To allow 
this, user-defined upper and lower bounds may be set on the 
number of genes within each section. A chromosome gener- 
ated by any genetic operator that breaks these bounds is 
disallowed. Once the lower bound is set to the required 
number of substituents, the fitness function has the effect of 
driving the chromosomes toward solutions containing this 
number of genes per section, since the simplest way to relieve 
excess redundancies is to use fewer substituents. Chromo- 
somes with more than the minimum number of bits but less 
than the upper bound are permitted in the population to allow 
the operators to produce more valid chromosomes than would 
otherwise be possible and to allow more diversity among the 
solutions to be maintained while the population is being 
optimized. 

For library designs with no fixed sizes, the upper and lower 
bounds are useful to reduce the search space. This they do 
by preventing consideration of solutions that greatly exceed 
the frequency cutoffs or have so few substituents as to be 
uninteresting. In this case the requirement to include as many 
bits as possible prevents the fitness function from driving the 
number of genes toward the lower bound. 

Each chromosome may be initialized by picking a different 
random number of genes between the upper and lower bounds 
to be set in each section of each chromosome and then 
randomly selecting gene positions until that number of genes 
has been set. 

A reviewer of this paper suggested that there may be an 
advantage to biasing the initialization toward diverse substit- 
uents, in cases in which the diversity of the substituents rather 
than the products is being optimized. This has been imple- 
mented by selecting a random number of genes between the 
upper and lower bounds, as before. Gene positions are again 
selected at random to be set, but this time a gene position is 
disallowed if another representing a substituent from the same 
cluster or family has already been set. If this is the case, it is 
left unset and the next random choice examined. A population 
size of around 100 is typically used. 

Step 3. Fitness Function. The fitness function consists 
of a number of parts, each of which optimizes a different aspect 
of the library: the molecular weight/formulae redundancy, the 
substituent molecular weight redundancies, the number of 
compounds, and the diversity. 

The method of evaluating the redundancy of the library is 
dependent on the type of mass spectroscopy to be used in the 
assay stage. For low-resolution MS, molecular weights are 
treated as integral. For a high-resolution experiment, it is 
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Figure 2. Calculation of excess frequency of molecular weight 
for scoring a chromosome. A frequency distribution of the 
number of occurrences of library products with each molecular 
weight is calculated. The total number of occurrences over a 
user-defined cutoff is calculated across the distribution. 

considered sufficient that the molecular formulae of the 
products not be redundant, so the number of occurrences of 
each unique formula is counted in this case. 

To evaluate the redundancy among molecular weights (or 
molecular formulae), the chromosome must first be decoded 
to identify the library products: The library products encoded 
in a chromosome may be enumerated by taking the substit- 
uents corresponding to every set gene in the chromosome 
section for R] in combination with every set gene in R2 in 
combination with every set gene in R 3 , etc. If the library 
synthesis strategy is to not mix the final position, then all 
combinations are taken over the first N — 1 sections rather 
than all N sections, since it is the frequency redundancies 
within each pool that are of concern. The molecular weight 
or formula for each substituent is calculated during the initial 
data loading so that the products may be quickly calculated 
by summing those of the relevant substituents; a constant may 
be added to take account of a core, although this is not 
necessary for the score. A frequency count of each unique 
weight or formula is kept: over all the library products. Note 
that is it not necessary to construct the actual structures of 
the library products but rather simply to sum together the 
weights or formulae of the substituents. 

To produce a redundancy score for the library, the total 
number of molecular weight redundancies over a user-defined 
cutoff is summed across the distribution, to give the excess 
frequency (see Figure Z). After some experimentation it was 
decided to use the squares of the individual excess frequencies 
at each molecular weight/formula. This tends to flatten the 
whole frequency distribution by ensuring, for example, that a 
distribution in which two separate molecular weights have an 
occurrence one greater than the cutoff is more favorable than 
one in which one weight has an excess frequency of two and a 
second of zero. 

A score for the substituent molecular weight redundancies 
is produced in a similar fashion. For each product molecular 
weight a separate frequency distribution is calculated for all 
the substituents in the library compounds having that molec- 
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ular weight. A separate score for each of these, again based 
on excess frequency of a second user-defined cutoff, is calcu- 
lated and the mean score taken over all the product molecular 
weights. 

The diversity of the library encoded in a single chromosome, 
when judged by the individual substituent sets, is assessed 
by considering the number of times each 3D family or 2D 
cluster occurs in each of the sections of the chromosome. These 
diversity calculations are conducted separately on each R 
position. However, it would also be possible to combine the 
precursor sets for all positions and cluster or partition' on one 
common chemical space. 

In the library design problems to which the 3D diversity 
measure has so far been applied, the number of families 
present in a set of precursors has always exceeded the number 
of substituents required in that position in the library design. 
In this case a simple diversity score can be produced for each 
section of the chromosome by imposing a penalty related to 
the number of repeated occurrences of any given family of 
cluster. At best the number of families that could be repre- 
sented in a section of a chromosome is equal to the number of 
genes set in that section, so the penalty score is based on the 
ratio of the number of repeated families to the number of bits 
set. The mean penalty score is taken across all sections to 
produce a diversity score for the chromosome. Scoring has to 
be normalized by the number of set genes in each to avoid 
biasing the diversity score by the number of precursors 
selected. Without this normalization the algorithm will tend 
to reduce family repeats by reducing the number of precursors 
selected. Note that if subsequent design problems arise in 
which there are fewer groups than required substituents, it 
will be straightforward to devise a score based on the evenness 
of the frequency distribution of group numbers across the 
selected substituent set. 

If the objective is to assess the diversity of the products, a 
frequency distribution of the cluster numbers or cell numbers 
is constructed for all the library products represented in the 
chromosome. Given a user-defined acceptable cluster or cell 
redundancy, a score is calculated for the excess frequency using 
the method described above for the molecular weight distribu- 
tions. 

As was the case for library II, it is sometimes required to 
maximize the number of products in a library, while keeping 
the deconvolution problem within acceptable bounds. In this 
case a penalty score is calculated by taking the mean across 
all sections of the chromosome of the square of the number of 
unset genes in each section. 

Having produced a score for each factor in the optimization, 
a weighted mean of these component scores is taken to be 
returned as the fitness of the chromosome. The relative 
weights on each individual factor are user-defined and can be 
adjusted to place more emphasis on some aspect or aspects of 
a given design problem. 

Step 4. Selection. A mating pool of size equal to the 
required number of replacements for a generation is created 
by roulette wheel selection, giving a greater chance to more 
fit members of the population to enter the mating pool. Rather 
than use the raw scores for this selection, our preliminary 
experiments showed it to be more effective to base the selection 
on rank. 29 The population is ranked according to the scores, 
and selection for entry into the mating pool is based on a 
function of the rank rather than the score itself. Experiment 
showed both linear and nonlinear rankings to be equally 
useful. In the former the normalized fitness is a linear 
function of the rank; in the latter the spacing is geometric. 

Step 5. Reproduction. Crossover is the genetic algorithm 
operator which swaps genetic material between parents to 
produce children. Two parents are first selected at random 
from the mating pool to produce children. For each gene in 
child 1 , uniform crossover 30 selects randomly whether that gene 
will be inherited from parent 1 or parent 2; child 2 inherits 
the corresponding genes from the other parent. Two-point 
crossover selects two gene positions in the parents. Child 1 
inherits the genes between these positions from parent 2 and 
outside them from parent 1. Once again, child 2 is the 
complement of child 1 . In preliminary experiments uniform 
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crossover was generally found to be more successful than two- 
point crossover. A crossover rate of around 0.8 is used, 
meaning that having selected two parents there is an 80% 
chance that crossover will be applied and a 20% chance that 
the parents will pass into the child pool directly. Mutation 
by simple inversion is applied to the children produced. Genes 
are selected at random, typically at a rate of around 1 or 2 
genes/chromosome, and the bit flipped from 1 to 0 or vice versa. 

Step 7. Replacement. The genetic algorithm has been 
run using both generational replacement and steady-state 
replacement In the former, one parent; generation gives rise 
to a whole child generation, and the parent generation is then 
either conditionally or unconditionally replaced by its children. 
In the latter case, as soon as a single child is produced, it is 
then conditionally or unconditionally inserted into the genera- 
tion, and so genetic material from the child is immediately 
available to influence the production of the next child, which 
it would not be in the former.* Our initial experiments showed 
that while generational replacement occasionally gives better 
solutions, the steady-state method gives much faster conver- 
gence and answers which are never too much worse than 
generation replacement; 

Experiment showed conditional replacement to be most 
effective; that is, child chromosomes only enter the population 
if they are an improvement on existing members of the 
population, meaning that weaker members are replaced at 
each stage. Elitism is used to unconditionally carry forward 
the best member of each generation into the next. 

Library Design Results 

Library I, discussed in the introduction and shown 
in Scheme 1, had two sites of diversity. The first 
substituent position, R,, was produced from carboxylic 
acids, chloroforinates, and amines and the second, R 2 , 
from carboxylic acids, aldehydes, and carbamoyl chlo- 
rides. A search of the Available Chemicals Directory 
(ACD), 31 produced 360 precursors compatible with the 
chemistry for Rj and 259 for R 2 . There are therefore 
93 240 possible library product compounds. The design 
of this library called for the production of 10 000 
compounds by combining 100 Ris with 100 R 2 s. De- 
convolution was to be by high-resolution experiment; 
therefore, the minimum redundancy was required for 
every unique molecular formula. Diversity was as- 
sessed on the precursor sets, and so precursors were 
transformed into substituents, and 3D families, based 
on common patterns of 3D pharmacophore points, were 
precomputed for each R position separately yielding 203 
families for Ri and 155 families for R 2 . 

The following results are for the genetic algorithm 
running in steady-state mode with a population of 100. 
The chromosome was defined to allow 100-150 genes 
for both Rj and R 2 . The frequency cutoff was set at 2; 
i.e., a frequency of 3 contributes I to the total excess 
frequency. Equal weigh was placed on diversity and 
deconvolution in the fitness function. Random initial- 
ization was used for the chromosomes. 

Figure 3 shows the variation of total excess frequency 
(TEF) and the total number of families present for the 
best chromosome over a typical run. Note that the TEFs 
quoted in this section are not squared although the 
squared value was used by the fitness function. Figure 
4 shows the molecular formula frequency distributions 
for generations 1 and 1000 and at convergence, which 
for this run was after 392 000 generations. On average, 
genes with fewer bits have a lower TEF score, so these 
come to dominate the population after the first few 
thousand generations. Following rapid improvement to 
the molecular frequency distribution, the remainder of 
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Figure 3. Evolution of library I. The total excess frequency 
(TEF) of molecular formulae and the total number of 3D 
families present in the precursor sets for R, and R 2 at each 
generation are shown. 
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Figure 4. Evolution of library I. The molecular formula 
frequency distribution is given for the best chromosome in the 
first, one-thousandth, and last generations. 

the run produces small improvements in the TEF while 
increasing the diversity of the library in terms of the 
number of different families represented by the 100 
precursors at each diversity site. Over all 17 000 unique 
molecular formulae, only 23 exceed the cutoff after the 
program converges. 

The typical run time for the program, using a Silicon 
Graphics Indigo2 running at 250 MHz, is approximately 
0.05 s CPU/generation in steady-state mode, of which 



Table 1. Variation in the Components of the Fitness Function 
Total Excess Frequency (TEF) of Molecular Formulae and 
Number of 3D Families, over 10 Runs for Library I, Resulting 
from the Stochastic Nature of the Genetic Algorithm- 1 



families 



run 


TEF 




R 2 


1 


23 


78 


84 
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25 


81 


77 
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20 


8i 


80 
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25 


79 


77 
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22 


80 


78 
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26 


74 


78 
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23 


79 


79 
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24 


81 


79 
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25 


81 


80 


10 


23 


81 


80 



a Only the seed to the random number generator is changed 
between runs. 

Table 2. Variation in the Total Excess Frequency (TEF) of 
Molecular Weight and Number of 3D Families over Six Runs of 
Library I, Using Initialization of the Population for Maximum 
Diversity 

families 



run 


TEF 


Rj 


R 2 


1 


32 


87 


86 


2 


28 


84 


83 


3 


27 


90 


82 


4 


33 


83 


86 


5 


36 


85 


86 


6 


28 


88 


83 



over 0.04 s CPU is spent in the fitness function. The 
total time to convergence for this library was ap- 
proximately 5.5 h. Examination of Figure 3 suggests 
that reasonable approximate solutions can be obtained 
in a much shorter time since the most rapid improve- 
ment to the score comes early in the run. Furthermore, 
in cases where no good solution can be found, possibly 
because the frequency cutoffs have been set unrealisti- 
cally low, this is often apparent early in the run when 
little improvement will be made during the first genera- 
tions. In this case the program can be terminated and 
rerun with adjustments to the GA parameters and 
frequency cutoffs. 

Since the GA method is stochastic, rerunning the 
program with the same parameters will produce differ- 
ent results. As long as the program is not converging 
prematurely, it should be expected that the final solu- 
tions will have approximately equal fitness scores, 
however. Table 1 shows the TEF and number of 
families for the best chromosome at convergence for 10 
runs of the program varying only the seed to the random 
number generator. The table shows that there is little 
variation in the TEF (between 20 and 26) or in the total 
number of families represented in Rj (74-81) or R? (77- 
84). 

The effects of initializing the chromosomes for maxi- 
mum diversity rather than entirely at random are 
shown in Table 2. These runs have produced a similar 
outcome, although with more diversity at the expense 
of a slightly worse deconvolution profile. Typically in 
the first generation of these runs, the TEF scores were 
in the 3000-5000 range, and so the final TEFs of these 
runs and those with the random initialization are seen 
to be very similar. The number of generations to 
convergence was similar in both sets of runs. Finding 
similar outcomes with two different types of starting 
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points is a further indication that the GA is not 
converging prematurely on solutions far from the global 
minimum. 

Since the chromosomes are initialized for maximum 
diversity in these cases and only changed to satisfy the 
molecular formula considerations, it can been seen that 
approximately 10-20% of the possible diversity at each 
R position has had to be sacrificed to satisfy the 
deconvolution constraint. 

For library II (see Scheme 2), three sites of diversity 
were available on a small molecule core. Positions Rj 
and R 3 were derived from aldehydes and R 2 from amino 
acids. In the existing inventory of precursors available 
within Abbott at the time at which the library was con- 
structed, there were 53 suitable candidates for the alde- 
hydes and 42 for the amino acids. The library was not 
to be of a predetermined size, but instead, a maximum 
allowed redundancy of 100 was specified for any one 
molecular weight and the largest possible size library 
required. In addition, within any one product molecular 
weight a limit of 10 redundancies at any one substituent 
weight was imposed. The precursors had already been 
selected for diversity so there was no need to optimize 
the number of families present in this case. After ex- 
periment, weights were picked for the different parts of 
the scoring function to ensure that the TEF was con- 
sistently at or very close to 0 and that the sum of the 
monomer TEFs was always 0. These weights were 10" 
2: 1 for monomer TEF:products TEF:library size. Upper 
and lower bounds of 10 and 50 monomers were applied 
to positions 1 and 3 and 10 and 42 to position 2. 

Figure 5 shows the molecular weight distribution of 
the best chromosome at generation 1 and at conver- 
gence. Since some chromosomes were generated with 
very few bits, some libraries in generation 1 have an 
extremely low TEF; however, they contain very few 
compounds and so have a poor score due to this latter 
factor. Table 3 shows the variation in TEF and total 
number of compounds in the library over 10 runs 
varying only the seed to the random number generator 
The results suggest the optimal library size to be 
between 21 672 and 24 300.. In addition it can be seen 
that the best results are always obtained using a large 
number of precursors at R t and R 3 relative to the total 
available (43-45 of 53) and a much smaller number at 
R 2 (11 or 12 of 42). This is an indication that there is 
most molecular weight redundancy in the precursor set 
for R 2 and that the best way to produce a larger library 
while staying within the deconvolution limits would be 
to identify more candidate precursors for R 2 . 

A final example, library III, required evaluation of the 
diversity of library products rather than the substituent 
sets. Two diversity sites were available on an asym- 
metric core with the same, set of 315 acids from which 
to choose at both positions. The chemistry shown in 
Scheme 1 was used to construct this library. The library 
size was to be 100 x 100. The 99 225 possible library 
products had 13 839 unique molecular formulae and 
were clustered, using Ward's agglomerative clustering 
and MACCS 2D fingerprints, into 10 000 clusters so that 
the most diverse set possible would have no more than 
one product from each cluster. In the fitness function 
equal weight was placed on the cluster and molecular 
formula redundancies. Initialization was by random 
assignment. Figure 6 shows the frequency of occurrence 
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Figure 5, Integral molecular weight frequency distribution 
o the best chromosome in the first and last generations for 
library II. 

Table 3. Variation in the Number of Precursors at Each 
Diversity Position and the Total Number of Compounds in the 
Library over 10 Runs for Library IF' 



run 


Ri 


R 2 


R 3 


total compounds 
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45 


12 


45 


24 300 


2 


44 


11 


45 


21 780 
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43 


12 


42 


21 672 
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43 
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44 


22 704 
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43 


12 


45 


23 220 
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45 
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46 


22 770 
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44 
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43 


22 704 
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43 
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43 


22 188 


43 


12 


44 


22 704 


10 


44 


12 


44 


23 232 


Only the seed to the 
the runs. 


random numbe 


r generator varies between 



of each cluster number in the products for the best 
chromosome in generation i and at convergence, ap- 
plying a redundancy cutoff of 4. Figure 7 shows the 
molecular formuJae redundancies for the same chromo- 
some, applying a cutoff of 3. Table 4 shows frequency 
counts for the total number of occurrences of each 
cluster size. Between generation 1 and convergence the 
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Figure 6. Cluster frequency distribution for the library 
products in the best chromosome of the first and last genera- 
tions for library III. 

Table 4. Evolution of the Diversity Score for Library IIP 



cluster size 



generation 1 
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4794 


3835 
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2017 


3404 


2 


1689 


1913 
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784 
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11 
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13 
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0 
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solution encoded by the best chromosome in the first and last 
generations. Larger clusters indicate more redundancy among the 
library products. Under-representation of small clusters indicates 
a lack of coverage. 

number of clusters represented many times is greatly 
reduced, while the number of clusters represented 
between zero and two times is increased, indicating an 
increase in diversity and also coverage. At convergence 
there are 3835 clusters not covered by compounds in 
the library and 5860 unrepresented molecular formulae. 
The total diversity that has had to be sacrificed both to 
the combinatorial constraint and to the deconvolution 
criteria is therefore seen to be just under 40% of the 
theoretical maximum (i.e., 3800 of 10 000 clusters). The 
variation in the two TEF statistics over five runs is 
shown in Table 5. The first run in this table is the one 
discussed above. Once again there is little variation in 
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Figure 7. Molecular formula frequency distribution for the 
best chromosome in the first and last generations for library 

Table 5. Variation in the Total Excess Frequency (TEF) of the 
Clusters and Molecular Formulae over Five Runs of Library III 



run 




TEF 


cluster 


molecular formula 


1 


10 


13 


2 


20 


15 


3 


14 


13 


4 


11 


12 


5 


17 


13 



the overall diversity or molecular formula redundancy 
among the solutions. 

Since the product diversity is being assessed rather 
than that of the substituents, it is not possible to 
initialize the chromosomes for maximum diversity. 
However, an initialization biased toward diversity was 
investigated in which the substituents were clustered, 
using MACCS keys and Ward's clustering, and chro- 
mosomes initialized by selecting no more than one 
member of any cluster in any chromosome. Over five 
runs there was no difference between the final product 
cluster TEFs or molecular formula TEFs for these runs 
and for those which started with random initialization. 

Discussion 

The GALOPED method permits the design of combi- 
natorial mixture libraries that simultaneously meet a 
number of design criteria. Efficient deconvolution of 
active compounds by mass spectroscopy is allowed for 
by designing libraries with the minimum redundancy 
of molecular weights or molecular formulae and option- 
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ally minimum redundancy among substituent molecular 
weights at any one product molecular weight Diversity 
among the library products is either accounted for 
directly from product cluster numbers or based on a 
selection of precursors that maximizes the number of 
different pharmacophore point patterns present among 
each set of substituents. There are several advantages 
to the genetic algorithm method: 

It is very fast and needs only to examine a tiny 
fraction of all potential libraries to find satisfactory 
solutions In library I for example, under 400 000 of 
the possible 10164 so i utions are considered It js of 
course, not possible to say how close the program may 
come to an optimal solution since this would require the 
enumeration of all possible libraries. However, it seems 
clear from examining the solutions present in the initial 
population of any run that it would not be possible to 
find good solutions simply by randomly generating 
possible hbraries. Given the types of chemistry cur 
rently being developed for solid support and the re- 
sources of the Available Chemicals Directory, it is easy 
to imagine that much larger candidate sets of precursors 
than those given in the examples in this paper may need 
nnn F ° r exam P Ie - a selection from 1000 x 

1UU0 x 1000 choices may not be unrealistic This 
increase in size on its own should not impose a limita- 
tion on the method. Indeed for a fixed final library size 
more candidates at each position might allow for more 
choices when attempting to maximize the diversity 
Uther factors will be more important to the speed of 
convergence of the algorithm. The size of the final 
design library will affect the speed of the fitness function 
since all possible substituent combinations have to be 
enumerated and evaluated. A second factor will be the 
amount of variability available in the products If 
unrealistic cutoffs for cluster and molecular weight 
redundancy are set in relation to the molecular weight 
and cluster profiles of all potential library products then 
convergence on good solutions will be much more 
computationally difficult to achieve. 

The method is able to optimize a number of often 
competing factors simultaneously, and it is readily 
extensible to include new design factors. The fitness 
function is independent of the searching algorithm and 
is the only part of the program which has knowledge of 
the problem domain. This means that it is straightfor- 
ward to modify the former without affecting the latter 
We have demonstrated optimizing the diversity of either 
the precursor sets or the product structures. However 
any other properties of either could equally well be 
considered instead of, or in addition to, diversity One 
can imagine many properties which might be optimized 
tor example, physical properties such as log Por the 
calculated affinity of the library products for a particular 
receptor. In addition it is equally simple to write 
functions to maximize the spread of properties or confine 
them to a required range. The only restriction to the 
use of library product properties is that there must be 
sufficiently few that they can be enumerated and the 
properties calculated. While one could imagine calcu- 
lating product properties on-the-fly for only those 
structures which are included in any chromosome 
analysis of many GALOPED runs suggests that almost 
ail possible library products will be examined at least 
once. There would therefore be little to be gained from 
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Table 7 Number of Precursors in Common between the Best 
Solution from Five Separate Runs for Library TIT- 
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49 
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Ri, below the diagonal to precursors for R 2 . y 

on-the-fly enumeration for our current examples. For 
much larger libraries it may be that a lower percentage 
of all possible library products would appear in any 
chromosome, in which case on-the-fly enumeration may 
become advantageous. However, the speed of enumera- 
tion of the required properties would be a limiting factor 
on their use with very large virtual libraries, and we 
may have to be content with simultaneously optimizing 
properties of the precursor sets in these cases 

The method can provide many different solutions to 
a problem from which the combinatorial chemist can 
choose^ These might be either different chromosomes 
from the same population or the best chromosome from 
many populations obtained running the program several 
times. Table 6 shows, for the best five different chro- 
mosomes in a single run of library III, the number of 
precursors in common between each pair of solutions 
Above the diagonal are the values for position R, below 
the diagonal for R 2 . The suggested libraries are'differ- 
ent by only one or two precursors, showing that the 
solutions in a single run represent slight variations on 
a common theme. Table 7 shows the equivalent results 
or the best chromosome over five separate runs of 
library III. In' contrast to the results in Table 6 this 
has resulted in the production of very different solutions 
albeit with approximately equal fitness. In either 
position any pair of solutions only has approximately 
one-half of the precursor sets in common. Structures 
of the precursor sets selected in these five runs are 
provided as Supporting Information. 

Conclusions 

The method we have described in this article allows 
us to design combinatorial mixture libraries optimized 
for biological screening for lead identification. It also 
provides the framework for the design of libraries to 
meet any calculable optimization function. 

Supporting Information Available: Chemical structures 
of the precursor sets selected in live runs referred to in Table 
I (VU pages). Ordering information is given on any current 
masthead page, y 
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